# Set global chunk obptions
knitr::opts_chunk$set(fig.width=12, fig.height=12, warning = F)
# Data
library(openxlsx)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
# Tables
library(table1)
#>
#> Attaching package: 'table1'
#> The following objects are masked from 'package:base':
#>
#> units, units<-
# Plots
library(ggplot2)
library(ggplotify)
# Load Metabolomics Pipeline
library(MetabolomicsPipeline)
# To upload image
library(magick)
#> Linking to ImageMagick 6.9.12.96
#> Enabled features: cairo, freetype, fftw, ghostscript, heic, lcms, pango, raw, rsvg, webp
#> Disabled features: fontconfig, x11
The purpose of the MetabolomicPipeline package is to provide additional tools to complement the analysis done by Metabolon. This package allows us to create a workflow that contains six different chunks.
Peak normalization and standardization
Analysis data creation
Data exploration
Subpathway Analysis
Pairwise comparisons
Box plot and line plots
In this vignette, we will use data which consists of 86 samples (42 males, 44 females), three treatment groups, and the samples were taken at three different time points. In the original experiment, the sex of each sample was not captured. Therefore, we have an additional metadata .xlsx file which contains the additional sex variable, as well as a variable called “PARENT_SAMPLE_NAME, which allows us to link the extra metadata to the sample metadata.
Metabolon includes normalized peak data as part of the Metabolon Excel file. We recommend you use the normalized peak data from Metabolon; therefore, the normalization and standardization steps are optional. However, in some cases, you may need to perform normalization differently than Metabolon provided. An example of this is experiments that include multiple batches. Metabolon will perform the normalization for each batch separately, which can cause downstream issues. When analyzing multiple batches we recommend combining the raw data and performing one normalization on all batches.
In its raw form, the peak data contain counts for each metabolite in each sample. Standardization and normalization are essential steps to improve the signal-to-noise ratio. We implement normalization and standardization by:
Standardizing each metabolite by the metabolites median value. (median_standardization)
Imputing missing values for each metabolite by the minimum value of the metabolite. (min_val_impute)
Log transform each value (log_trans_met)
In the chunk below, we load the raw peak data from the Metabolon Excel sheet and perform the normalization and standardization steps. Each step only requires the output from the previous step.
###############################################################################
##### Load Data ###############################################################
###############################################################################
# Provide a path to Metabolon .xlsx file.
metabolon_path <- "../data/UNAZ-0501-22VW_ DATA TABLES.xlsx"
# Load raw peak data
peak_data <- read.xlsx(metabolon_path, sheet = "Peak Area Data")
###############################################################################
###### Standardization and Normalization ######################################
###############################################################################
# 1 Median standardization
med_std <- median_standardization(peak_data = peak_data)
# 2 Minimum Value imputation
impute <- min_val_impute(med_std)
# 3 Log Transformation
log_trans_met <- log_transformation(impute)
We receive multiple datasets from Metabolon, each on a different tab of the Metabolon Excel file. The datasets needed for the downstream analysis are the sample metadata, chemical annotation, and log-transformed peak data. The table below shows the tab that each dataset is on within the Metabolon Excel file.
| Data | Tab of excel file |
|---|---|
| Sample metadata | Sample Meta Data |
| Chemical annotation | Chemical Annotation |
| Log-transformed peak data | Log Transformed Data |
The MetabolomicsPipeline package provides a convenient way to load each of these datasets together using the “loadMetPipeFromMetabolon” function, which only requires you to provide the path to the Metabolon Excel file.
In addition to the data from the Metabolon Excel file, we will also create an analysis dataset, which contains the analysis variables from the sample metadata and the log-transformed peak data.
We take the following steps to create the analysis data.
Enter metadata variables for analysis.
Create the analysis data and add it to the rest of the experiment data.
Create a table showing the sample distribution.
################################################################################
### Load Data ##################################################################
################################################################################
# Load Metabolon data from the Excel file
dat <- loadMetPipeFromMetabolon(metabolon_path = "../data/UNAZ-0501-22VW_ DATA TABLES.xlsx" )
# load additional metadata
meta_data_additional <- read.xlsx("../data/AdditionalVars.xlsx")
# 2. Merge additional vars to the meta data
if(nrow(meta_data_additional)>0){
meta_data <- meta_data_additional %>%
left_join(dat@meta,"PARENT_SAMPLE_NAME")
# Update meta data slot
dat@meta <- meta_data
}
################################################################################
### Enter Metadata Variables For Analysis ######################################
################################################################################
metadata_variables <- c("PARENT_SAMPLE_NAME",
"GROUP_NAME",
"TIME1",
"Gender")
################################################################################
### Add Analysis Data To MetPipe Object ########################################
################################################################################
# 2. Create analysis data
dat@analysis <- dat@meta %>%
select(all_of(metadata_variables)) %>%
left_join(dat@standardized_peak,"PARENT_SAMPLE_NAME")
################################################################################
### Create Table 1 #############################################################
################################################################################
# 5. Create table 1
tbl1 <- table1(~ TIME1 + GROUP_NAME| Gender
, data = dat@analysis)
# 6. Display table 1
tbl1
| Female (N=44) |
Male (N=42) |
Overall (N=86) |
|
|---|---|---|---|
| TIME1 | |||
| End | 14 (31.8%) | 15 (35.7%) | 29 (33.7%) |
| Onset | 15 (34.1%) | 14 (33.3%) | 29 (33.7%) |
| PreSymp | 15 (34.1%) | 13 (31.0%) | 28 (32.6%) |
| GROUP_NAME | |||
| 1902 | 15 (34.1%) | 14 (33.3%) | 29 (33.7%) |
| Control | 14 (31.8%) | 14 (33.3%) | 28 (32.6%) |
| WT | 15 (34.1%) | 14 (33.3%) | 29 (33.7%) |
# Save data
# saveRDS(dat, file = "data/dat.Rds")
In data exploration, we use several methods to help us better understand the underlying patterns in the data without using a formal hypothesis test. In this pipeline, we are going to focus on two methods of data exploration:
A.) Principal component analysis
B.) Heatmaps
In general, Principal component analysis (PCA) reduces the number of variables in a dataset while preserving as much information from the data as possible. At a high level, PCA is constructed such that the first principal component (PC) accounts for the largest amount of variance within the data. The second PC accounts for the largest remaining variance, and so on. Additionally, each of the PCs produced by PCA is uncorrelated with the other principal components. PCA can allow us to visualize sources of variation in the data. The metabolite_pca function will enable us to specify a sample metadata variable to label the points in the plot.
Suppose you notice a variable with clearly separated groups that is not a variable of interest. In that case, consider stratifying your downstream analysis by the values of that variable. For example, we will stratify the downstream analysis by male/female in our vignette data.
For our heatmap, the x-axis will be the samples, and the y-axis will be the metabolites. The values determining the colors will be the log normalized peak values for each metabolite in each observation. We can group the observations by the experimental conditions. Grouping the experimental conditions in a heatmap is another way of visualizing sources of variation within our data.
We can use the metabolite_heatmap function to create the heatmaps, which requires the following arguments.
MetPipe: This is the experiment data
top_mets: The number of metabolites to include in the heatmap. The metabolites are chosen based on the metabolites with the highest variability.
group_vars: The variables to group the samples by. caption: The title of the heatmap.
… : You can add additional arguments to order the samples
In the chunk below, we create a PCA plot labeled by Gender. Then, we make three heatmaps increasing by complexity.
###############################################################################
### Run PCA ###################################################################
###############################################################################
# Define PCA label from metadata
meta_var = "Gender"
# Run PCA
pca <- metabolite_pca( dat,
meta_var = meta_var)
# Show heatmap
pca
################################################################################
### Run Heatmaps ###############################################################
################################################################################
# Heatmap with one group
treat_heatmap <- metabolite_heatmap(dat,top_mets = 50,
group_vars = "GROUP_NAME",
strat_var = NULL,
caption = "Heatmap Arranged By Group",
GROUP_NAME)
as.ggplot(treat_heatmap)
# Heatmap with two groups
treatandtime <- metabolite_heatmap(dat,top_mets = 50,
group_vars = c("GROUP_NAME","TIME1"),
strat_var = NULL,
caption = "Heatmap Arranged By Group and TIME",
GROUP_NAME, desc(TIME1))
as.ggplot(treatandtime)
# Heatmap with 2 group and stratified
strat_heat <- metabolite_heatmap(dat,top_mets = 50,
group_vars = c("GROUP_NAME","TIME1"),
strat_var = "Gender",
caption = "Heatmap Arranged By Group and TIME",
GROUP_NAME, desc(TIME1))
## Female
as.ggplot(strat_heat$Female)
# Male
as.ggplot(strat_heat$Male)
In the chemical annotation file, we will see that each metabolite is within a sub-pathway, and each subpathway is within a superpathway. There are several metabolites within each subpathway and several sub-pathways within each Super-pathway. We can utilize an Analysis of variance (ANOVA) model to test for a difference in peak intensities between the treatment groups at the metabolite level, which is already part of the Metabolon analysis. However, since multiple metabolites are within a sub-pathway, it is challenging to test if the treatment affected the peak data at the sub-pathway level. For this, we utilize a combined Fisher probability test. The combined Fisher test combines the p-values from independent tests to test the hypothesis for an overall effect. The Combined Fisher Probability is helpful for testing a model at the subpathway level based on the pvalues from the model at the metabolite level.
We will test at the subpathway level by combining the p-values for each metabolite within the subpathway for each model. We use a combination function given by \(\tilde{X}\) which combines the pvalues, resulting in a chi-squared test statistic.
\[ \tilde{X} = -2\sum_{i=1}^k ln(p_i) \] where \(k\) is the number of metabolites in the subpathway. We can get a p-value from \(P(X \geq\tilde{X})\), knowing that \(\tilde{X}\sim \chi^2_{2k}\). You will notice that smaller p-values will lead to a larger \(\tilde{X}\).
Since we are first testing each metabolite utilizing ANOVA, we make the following assumptions for each metabolite,
Independence: Each observation is independent of all other observations. Therefore, if you have collected multiple samples from the same subject then this assumption may be violated.
Normality: The metabolite log-scaled intensities follow a normal distribution within each of the treatment groups.
Homoscedasticity: Equal variance between treatment groups.
In addition to the assumptions in the ANOVA models at the metabolite level, the Fisher’s Combined probability places an independence assumption between the metabolites within the subpathway.
For more about the Combined Fisher Probability and other methods that can address this problem, see:
Loughin, Thomas M. “A systematic comparison of methods for combining p-values from independent tests.” Computational statistics & data analysis 47.3 (2004): 467-485.
To test our hypothesis at the subpathway level, we first have to form our hypothesis at the metabolite level. For each metabolite, we test three models.
1.) Interaction: \(log Peak = Treatment Group + Time + Treatment*Time\)
2.) Parallel: \(log Peak = Treatment Group + Time\)
3.) Single: \(log Peak = Treatment\)
For the interaction model, we are focusing only on the interaction term “Treatment*Time” to test if there is a significant interaction between our treatment and the time variable. The parallel model is testing if the time variable is explaining a significant amount of the metabolite variance, and the treatment model is testing if the treatment explains a significant proportion of the variance for each metabolite.
We test at the subpathway level using the Combined Fisher Probability method to combine the p-values from each model for all metabolites within the subpathway. To run the subpathway analysis, we use the “subpathway_analysis” function, which requires the following arguments.
MetPipe: The experiment data.
treat_var: The treatment variable of interest.
block_var: The block variable, in our example the block variable is going to be time.
strat_var: The name of the variable we want to stratify our analysis by. In our this is going to be “Gender”.
With the MetabolomicsPipeline package, we provide three different ways to summarize the results from the subpathway analysis.
Number of significant subpathways by model type (subpath_by_model)
Percentage of significant subpathways within superpathways (subpath_within_superpath)
Metabolite model results within a specified subpathway (met_within_sub)
################################################################################
## Stratified Analysis #########################################################
################################################################################
# Stratified Analysis
stratified = subpathway_analysis(dat,
treat_var = "GROUP_NAME",
block_var = "TIME1",
strat_var = "Gender")
#> [1] "simpleWarning in anova.lm(int_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924625"
#> [1] "simpleWarning in anova.lm(par_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924625"
#> [1] "simpleWarning in anova.lm(treat_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924625"
#> [1] "simpleWarning in anova.lm(int_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999925396"
#> [1] "simpleWarning in anova.lm(par_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999925396"
#> [1] "simpleWarning in anova.lm(treat_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999925396"
#> [1] "simpleWarning in anova.lm(int_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 100001394"
#> [1] "simpleWarning in anova.lm(par_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 100001394"
#> [1] "simpleWarning in anova.lm(treat_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 100001394"
#> [1] "simpleWarning in anova.lm(int_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924548"
#> [1] "simpleWarning in anova.lm(par_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924548"
#> [1] "simpleWarning in anova.lm(treat_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924548"
#> [1] "simpleWarning in anova.lm(int_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924748"
#> [1] "simpleWarning in anova.lm(par_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924748"
#> [1] "simpleWarning in anova.lm(treat_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924748"
################################################################################
### Results Plots ##############################################################
################################################################################
# 1. significant subpathways by model type
subpath_by_model(stratified)
| Model Type | Female | Male |
|---|---|---|
| Interaction | 6 | 84 |
| Parallel | 20 | 12 |
| Single | 19 | 2 |
| None | 65 | 12 |
# 2. Percentage of signficant subpathways within superpathways
subpath_within_superpath(stratified)
| Super Pathway | Percent Significant (Female) | Percent Significant (Male) |
|---|---|---|
| Xenobiotics | 5 / 5 (100%) | 5 / 5 (100%) |
| Amino Acid | 13 / 16 (81.25%) | 16 / 16 (100%) |
| Peptide | 3 / 5 (60%) | 3 / 5 (60%) |
| Nucleotide | 3 / 8 (37.5%) | 8 / 8 (100%) |
| Lipid | 16 / 53 (30.19%) | 46 / 53 (86.79%) |
| Carbohydrate | 2 / 8 (25%) | 7 / 8 (87.5%) |
| Cofactors and Vitamins | 2 / 11 (18.18%) | 9 / 11 (81.82%) |
| Energy | 0 / 2 (0%) | 2 / 2 (100%) |
| Partially Characterized Molecules | 0 / 1 (0%) | 1 / 1 (100%) |
# 3. Metabolites within subpathway
tables <- met_within_sub(stratified, subpathway = "Partially Characterized Molecules")
### Females
tables[[1]]
| Metabolite Name | Interaction P-Value | Parallel P-Value | Single P-Value |
|---|---|---|---|
| glycolate (hydroxyacetate) | 0.437 | 0.435 | 0.408 |
| iminodiacetate (IDA) | 0.936 | 0.373 | 0.017 |
| EDTA | 0.709 | 0.269 | 0.098 |
| ectoine | 0.833 | 0.742 | 0.572 |
| sulfate* | 0.596 | 0.880 | 0.858 |
| ethyl glucuronide | 0.293 | 0.891 | 0.329 |
| dimethyl sulfone | 0.703 | 0.350 | 0.908 |
| 3-acetylphenol sulfate | 0.533 | 0.033 | 0.018 |
| benzoylcarnitine* | 0.594 | 0.271 | 0.067 |
| S-(3-hydroxypropyl)mercapturic acid (HPMA) | 0.556 | 0.346 | 0.055 |
| O-sulfo-tyrosine | 0.558 | 0.632 | 0.086 |
| 3-hydroxypyridine sulfate | 0.470 | 0.163 | 0.001 |
| 6-hydroxyindole sulfate | 0.266 | 0.292 | 0.051 |
| 1,2,3-benzenetriol sulfate (2) | 0.208 | 0.097 | 0.000 |
| thioproline | 0.493 | 0.113 | 0.387 |
| 4-acetamidobenzoate | 0.069 | 0.026 | 0.084 |
| perfluorooctanesulfonate (PFOS) | 0.397 | 0.010 | 0.199 |
| methylnaphthyl sulfate (1)* | 0.443 | 0.670 | 0.065 |
| methylnaphthyl sulfate (2)* | 0.648 | 0.633 | 0.002 |
| 2,2’-methylenebis(6-tert-butyl-p-cresol) | 0.011 | 0.000 | 0.100 |
| 3-hydroxy-2-methylpyridine sulfate | 0.392 | 0.083 | 0.001 |
| 3,5-dichloro-2,6-dihydroxybenzoic acid | 0.442 | 0.002 | 0.253 |
| 3-bromo-5-chloro-2,6-dihydroxybenzoic acid* | 0.613 | 0.008 | 0.208 |
| 4-chlorobenzoic acid | 0.364 | 0.538 | 0.664 |
| perfluorohexanesulfonate (PFHxS) | 0.529 | 0.123 | 0.230 |
### Males
tables[[2]]
| Metabolite Name | Interaction P-Value | Parallel P-Value | Single P-Value |
|---|---|---|---|
| glycolate (hydroxyacetate) | 0.054 | 0.069 | 0.181 |
| iminodiacetate (IDA) | 0.052 | 0.954 | 0.001 |
| EDTA | 0.116 | 0.987 | 0.005 |
| ectoine | 0.492 | 0.019 | 0.284 |
| sulfate* | 0.315 | 0.247 | 0.891 |
| ethyl glucuronide | 0.032 | 0.114 | 0.035 |
| dimethyl sulfone | 0.002 | 0.021 | 0.236 |
| 3-acetylphenol sulfate | 0.160 | 0.566 | 0.837 |
| benzoylcarnitine* | 0.378 | 0.272 | 0.055 |
| S-(3-hydroxypropyl)mercapturic acid (HPMA) | 0.033 | 0.027 | 0.113 |
| O-sulfo-tyrosine | 0.162 | 0.020 | 0.618 |
| 3-hydroxypyridine sulfate | 0.095 | 0.100 | 0.161 |
| 6-hydroxyindole sulfate | 0.064 | 0.069 | 0.498 |
| 1,2,3-benzenetriol sulfate (2) | 0.293 | 0.361 | 0.458 |
| thioproline | 0.004 | 0.931 | 0.371 |
| 4-acetamidobenzoate | 0.801 | 0.190 | 0.253 |
| perfluorooctanesulfonate (PFOS) | 0.058 | 0.122 | 0.387 |
| methylnaphthyl sulfate (1)* | 0.042 | 0.064 | 0.055 |
| methylnaphthyl sulfate (2)* | 0.033 | 0.013 | 0.067 |
| 2,2’-methylenebis(6-tert-butyl-p-cresol) | 0.006 | 0.000 | 0.419 |
| 3-hydroxy-2-methylpyridine sulfate | 0.258 | 0.095 | 0.369 |
| 3,5-dichloro-2,6-dihydroxybenzoic acid | 0.043 | 0.002 | 0.251 |
| 3-bromo-5-chloro-2,6-dihydroxybenzoic acid* | 0.044 | 0.020 | 0.135 |
| 4-chlorobenzoic acid | 0.933 | 0.343 | 0.617 |
| perfluorohexanesulfonate (PFHxS) | 0.026 | 0.486 | 0.049 |
We can look at the pairwise comparisons for all experimental groups at the metabolite level. Metabolon includes this as part of their analysis. However, if you need to change the model, you must rerun the pairwise analysis. We will use the metabolite_pairwise function within the MetabolomicsPipeline package, which requires the following arguments:
MetPipe: The experiment data.
form: The pairwise_analysis function uses the model \(log Peak = form\). Therefore, we must specify the right-hand side of the model. In our example, we will use form = “GROUP_NAME + TIME1 + GROUP_NAME*TIME1”.
We will produce a heatmap of the log fold changes for the metabolites with a significant overall p-value (which tested if the treatment group means were equal under the null hypothesis). The heatmap colors will only show if the log fold-change is greater than log(2) or less than log(.5). Therefore, this heatmap will only focus on comparisons with a fold change of two or greater. The met_est_heatmap function will produce an interactive heatmap using the results from the pairwise analysis.
Similar to the pairwise estimate heatmap, we will produce a heatmap
where the heatmap will only include metabolites with a significant
overall p-value, and the values in the heat map will only be colored if
the pairwise comparison is significant. We use the
met_p_heatmap function to create an interactive p-value heatmap.
################################################################################
#### Run Pairwise Comparisons ##################################################
################################################################################
strat_pairwise = metabolite_pairwise(dat,form = "GROUP_NAME*TIME1",strat_var = "Gender")
###############################################################################
## Create Estimate Heatmap #####################################################
################################################################################
met_est_heatmap(strat_pairwise$Female, dat)
################################################################################
## Create P-value Heatmap ######################################################
################################################################################
# Female
met_p_heatmap(strat_pairwise$Female, dat)
Visualizations of the data can help us see the underlying trends. Two useful visualizations are boxplots and line plots, we will be using the subpathway_boxplots and subpathway_lineplots functions to create them. The main utility of these functions is it allows you for focus on the metabolites within a subpathway. For both functions, the arguments are:
MetPipe: The analysis data
subpathway: The name of the subpathway you would like for focus on
X: The name of the variable for the X-axis
groupBy: Name of the variable to group the results by. For us, we are going to focus on the GROUP_NAME variable since we want to compare the treatment groups.
… You can pass additional arguments to filter the results. In our example, we will focus on Gender==“Female” only.
################################################################################
### BoxPlots ###################################################################
################################################################################
subpathway_boxplots(dat, subpathway = "Lactoyl Amino Acid", X=TIME1,
groupBy = GROUP_NAME, Gender =="Female")
################################################################################
## Line plots ##################################################################
################################################################################
# Set up data
dat@analysis$TIME1 <- as.numeric(factor(dat@analysis$TIME1,
levels = c("PreSymp","Onset","End")))
# Create line plots
subpathway_lineplots(dat, subpathway = "Lactoyl Amino Acid",
X= TIME1,groupBy = GROUP_NAME, Gender=="Female" )
#> `geom_smooth()` using formula = 'y ~ x'